DROPS

Document

DOI: 10.4230/LIPIcs.CPM.2022.7

Beyond the Longest Letter-Duplicated Subsequence Problem

Authors: Wenfeng Lai, Adiesha Liyanage, Binhai Zhu, and Peng Zou

Published in: LIPIcs, Volume 223, 33rd Annual Symposium on Combinatorial Pattern Matching (CPM 2022)

Abstract

Motivated by computing duplication patterns in sequences, a new fundamental problem called the longest letter-duplicated subsequence (LLDS) is proposed. Given a sequence S of length n, a letter-duplicated subsequence is a subsequence of S in the form of x₁^{d₁}x₂^{d₂}⋯ x_k^{d_k} with x_i ∈ Σ, x_j≠ x_{j+1} and d_i ≥ 2 for all i in [k] and j in [k-1]. A linear time algorithm for computing the longest letter-duplicated subsequence (LLDS) of S can be easily obtained. In this paper, we focus on two variants of this problem. We first consider the constrained version when Σ is unbounded, each letter appears in S at least 6 times and all the letters in Σ must appear in the solution. We show that the problem is NP-hard (a further twist indicates that the problem does not admit any polynomial time approximation). The reduction is from possibly the simplest version of SAT that is NP-complete, (≤ 2,1, ≤ 3)-SAT, where each variable appears at most twice positively and exact once negatively, and each clause contains at most three literals and some clauses must contain exactly two literals. (We hope that this technique will serve as a general tool to help us proving the NP-hardness for some more tricky sequence problems involving only one sequence - much harder than with at least two input sequences, which we apply successfully at the end of the paper on some extra variations of the LLDS problem.) We then show that when each letter appears in S at most 3 times, then the problem admits a factor 1.5-O(1/n) approximation. Finally, we consider the weighted version, where the weight of a block x_i^{d_i} (d_i ≥ 2) could be any positive function which might not grow with d_i. We give a non-trivial O(n²) time dynamic programming algorithm for this version, i.e., computing an LD-subsequence of S whose weight is maximized.

Cite as

Wenfeng Lai, Adiesha Liyanage, Binhai Zhu, and Peng Zou. Beyond the Longest Letter-Duplicated Subsequence Problem. In 33rd Annual Symposium on Combinatorial Pattern Matching (CPM 2022). Leibniz International Proceedings in Informatics (LIPIcs), Volume 223, pp. 7:1-7:12, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2022)

Copy BibTex To Clipboard

@InProceedings{lai_et_al:LIPIcs.CPM.2022.7,
  author =	{Lai, Wenfeng and Liyanage, Adiesha and Zhu, Binhai and Zou, Peng},
  title =	{{Beyond the Longest Letter-Duplicated Subsequence Problem}},
  booktitle =	{33rd Annual Symposium on Combinatorial Pattern Matching (CPM 2022)},
  pages =	{7:1--7:12},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-234-1},
  ISSN =	{1868-8969},
  year =	{2022},
  volume =	{223},
  editor =	{Bannai, Hideo and Holub, Jan},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops-dev.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2022.7},
  URN =		{urn:nbn:de:0030-drops-161348},
  doi =		{10.4230/LIPIcs.CPM.2022.7},
  annote =	{Keywords: Segmental duplications, Tandem duplications, Longest common subsequence, NP-completeness, Dynamic programming}
}

Document

DOI: 10.4230/LIPIcs.CPM.2020.22

Genomic Problems Involving Copy Number Profiles: Complexity and Algorithms

Authors: Manuel Lafond, Binhai Zhu, and Peng Zou

Published in: LIPIcs, Volume 161, 31st Annual Symposium on Combinatorial Pattern Matching (CPM 2020)

Abstract

Recently, due to the genomic sequence analysis in several types of cancer, genomic data based on copy number profiles (CNP for short) are getting more and more popular. A CNP is a vector where each component is a non-negative integer representing the number of copies of a specific segment of interest. The motivation is that in the late stage of certain types of cancer, the genomes are progressing rapidly by segmental duplications and deletions, and hence obtaining the exact sequences becomes difficult. Instead, the number of copies of important segments can be predicted from expression analysis and carries important biological information. Therefore, significant research has recently been devoted to the analysis of genomic data represented as CNP’s. In this paper, we present two streams of results. The first is the negative results on two open problems regarding the computational complexity of the Minimum Copy Number Generation (MCNG) problem posed by Qingge et al. in 2018. The Minimum Copy Number Generation (MCNG) is defined as follows: given a string S in which each character represents a gene or segment, and a CNP C, compute a string T from S, with the minimum number of segmental duplications and deletions, such that cnp(T)=C. It was shown by Qingge et al. that the problem is NP-hard if the duplications are tandem and they left the open question of whether the problem remains NP-hard if arbitrary duplications and/or deletions are used. We answer this question affirmatively in this paper; in fact, we prove that it is NP-hard to even obtain a constant factor approximation. This is achieved through a general-purpose lemma on set-cover reductions that require an exact cover in one direction, but not the other, which might be of independent interest. We also prove that the corresponding parameterized version is W[1]-hard, answering another open question by Qingge et al. The other result is positive and is based on a new (and more general) problem regarding CNP’s. The Copy Number Profile Conforming (CNPC) problem is formally defined as follows: given two CNP’s C₁ and C₂, compute two strings S₁ and S₂ with cnp(S₁)=C₁ and cnp(S₂)=C₂ such that the distance between S₁ and S₂, d(S₁,S₂), is minimized. Here, d(S₁,S₂) is a very general term, which means it could be any genome rearrangement distance (like reversal, transposition, and tandem duplication, etc). We make the first step by showing that if d(S₁,S₂) is measured by the breakpoint distance then the problem is polynomially solvable. We expect that this will trigger some related research along the line in the near future.

Cite as

Manuel Lafond, Binhai Zhu, and Peng Zou. Genomic Problems Involving Copy Number Profiles: Complexity and Algorithms. In 31st Annual Symposium on Combinatorial Pattern Matching (CPM 2020). Leibniz International Proceedings in Informatics (LIPIcs), Volume 161, pp. 22:1-22:15, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2020)

Copy BibTex To Clipboard

@InProceedings{lafond_et_al:LIPIcs.CPM.2020.22,
  author =	{Lafond, Manuel and Zhu, Binhai and Zou, Peng},
  title =	{{Genomic Problems Involving Copy Number Profiles: Complexity and Algorithms}},
  booktitle =	{31st Annual Symposium on Combinatorial Pattern Matching (CPM 2020)},
  pages =	{22:1--22:15},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-149-8},
  ISSN =	{1868-8969},
  year =	{2020},
  volume =	{161},
  editor =	{G{\o}rtz, Inge Li and Weimann, Oren},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops-dev.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2020.22},
  URN =		{urn:nbn:de:0030-drops-121471},
  doi =		{10.4230/LIPIcs.CPM.2020.22},
  annote =	{Keywords: Computational genomics, cancer genomics, copy number profiles, NP-hardness, approximation algorithms, FPT algorithms}
}

Document

DOI: 10.4230/LIPIcs.STACS.2020.15

The Tandem Duplication Distance Is NP-Hard

Authors: Manuel Lafond, Binhai Zhu, and Peng Zou

Published in: LIPIcs, Volume 154, 37th International Symposium on Theoretical Aspects of Computer Science (STACS 2020)

Abstract

In computational biology, tandem duplication is an important biological phenomenon which can occur either at the genome or at the DNA level. A tandem duplication takes a copy of a genome segment and inserts it right after the segment - this can be represented as the string operation AXB ⇒ AXXB. Tandem exon duplications have been found in many species such as human, fly or worm, and have been largely studied in computational biology. The Tandem Duplication (TD) distance problem we investigate in this paper is defined as follows: given two strings S and T over the same alphabet, compute the smallest sequence of tandem duplications required to convert S to T. The natural question of whether the TD distance can be computed in polynomial time was posed in 2004 by Leupold et al. and had remained open, despite the fact that tandem duplications have received much attention ever since. In this paper, we prove that this problem is NP-hard, settling the 16-year old open problem. We further show that this hardness holds even if all characters of S are distinct. This is known as the exemplar TD distance, which is of special relevance in bioinformatics. One of the tools we develop for the reduction is a new problem called the Cost-Effective Subgraph, for which we obtain W[1]-hardness results that might be of independent interest. We finally show that computing the exemplar TD distance between S and T is fixed-parameter tractable. Our results open the door to many other questions, and we conclude with several open problems.

Cite as

Manuel Lafond, Binhai Zhu, and Peng Zou. The Tandem Duplication Distance Is NP-Hard. In 37th International Symposium on Theoretical Aspects of Computer Science (STACS 2020). Leibniz International Proceedings in Informatics (LIPIcs), Volume 154, pp. 15:1-15:15, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2020)

Copy BibTex To Clipboard

@InProceedings{lafond_et_al:LIPIcs.STACS.2020.15,
  author =	{Lafond, Manuel and Zhu, Binhai and Zou, Peng},
  title =	{{The Tandem Duplication Distance Is NP-Hard}},
  booktitle =	{37th International Symposium on Theoretical Aspects of Computer Science (STACS 2020)},
  pages =	{15:1--15:15},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-140-5},
  ISSN =	{1868-8969},
  year =	{2020},
  volume =	{154},
  editor =	{Paul, Christophe and Bl\"{a}ser, Markus},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops-dev.dagstuhl.de/entities/document/10.4230/LIPIcs.STACS.2020.15},
  URN =		{urn:nbn:de:0030-drops-118769},
  doi =		{10.4230/LIPIcs.STACS.2020.15},
  annote =	{Keywords: Tandem duplication, Text processing, Formal languages, Computational genomics, FPT algorithms}
}

Document

DOI: 10.4230/LIPIcs.CPM.2019.5

A 2-Approximation Algorithm for the Complementary Maximal Strip Recovery Problem

Authors: Haitao Jiang, Jiong Guo, Daming Zhu, and Binhai Zhu

Published in: LIPIcs, Volume 128, 30th Annual Symposium on Combinatorial Pattern Matching (CPM 2019)

Abstract

The Maximal Strip Recovery problem (MSR) and its complementary (CMSR) are well-studied NP-hard problems in computational genomics. The input of these dual problems are two signed permutations. The goal is to delete some gene markers from both permutations, such that, in the remaining permutations, each gene marker has at least one common neighbor. Equivalently, the resulting permutations could be partitioned into common strips of length at least two. Then MSR is to maximize the number of remaining genes, while the objective of CMSR is to delete the minimum number of gene markers. In this paper, we present a new approximation algorithm for the Complementary Maximal Strip Recovery (CMSR) problem. Our approximation factor is 2, improving the currently best 7/3-approximation algorithm. Although the improvement on the factor is not huge, the analysis is greatly simplified by a compensating method, commonly referred to as the non-oblivious local search technique. In such a method a substitution may not always increase the value of the current solution (it sometimes may even decrease the solution value), though it always improves the value of another function seemingly unrelated to the objective function.

Cite as

Haitao Jiang, Jiong Guo, Daming Zhu, and Binhai Zhu. A 2-Approximation Algorithm for the Complementary Maximal Strip Recovery Problem. In 30th Annual Symposium on Combinatorial Pattern Matching (CPM 2019). Leibniz International Proceedings in Informatics (LIPIcs), Volume 128, pp. 5:1-5:13, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2019)

Copy BibTex To Clipboard

@InProceedings{jiang_et_al:LIPIcs.CPM.2019.5,
  author =	{Jiang, Haitao and Guo, Jiong and Zhu, Daming and Zhu, Binhai},
  title =	{{A 2-Approximation Algorithm for the Complementary Maximal Strip Recovery Problem}},
  booktitle =	{30th Annual Symposium on Combinatorial Pattern Matching (CPM 2019)},
  pages =	{5:1--5:13},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-103-0},
  ISSN =	{1868-8969},
  year =	{2019},
  volume =	{128},
  editor =	{Pisanti, Nadia and P. Pissis, Solon},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops-dev.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2019.5},
  URN =		{urn:nbn:de:0030-drops-104769},
  doi =		{10.4230/LIPIcs.CPM.2019.5},
  annote =	{Keywords: Maximal strip recovery, complementary maximal strip recovery, computational genomics, approximation algorithm, local search}
}

Document

Complete Volume

DOI: 10.4230/LIPIcs.CPM.2018

LIPIcs, Volume 105, CPM'18, Complete Volume

Authors: Gonzalo Navarro, David Sankoff, and Binhai Zhu

Published in: LIPIcs, Volume 105, 29th Annual Symposium on Combinatorial Pattern Matching (CPM 2018)

Abstract

LIPIcs, Volume 105, CPM'18, Complete Volume

Cite as

29th Annual Symposium on Combinatorial Pattern Matching (CPM 2018). Leibniz International Proceedings in Informatics (LIPIcs), Volume 105, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2018)

Copy BibTex To Clipboard

@Proceedings{navarro_et_al:LIPIcs.CPM.2018,
  title =	{{LIPIcs, Volume 105, CPM'18, Complete Volume}},
  booktitle =	{29th Annual Symposium on Combinatorial Pattern Matching (CPM 2018)},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-074-3},
  ISSN =	{1868-8969},
  year =	{2018},
  volume =	{105},
  editor =	{Navarro, Gonzalo and Sankoff, David and Zhu, Binhai},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops-dev.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2018},
  URN =		{urn:nbn:de:0030-drops-89341},
  doi =		{10.4230/LIPIcs.CPM.2018},
  annote =	{Keywords: Mathematics of computing, Discrete mathematics, Information theory,Information systems, Information retrieval, Theory of computation}
}

Document

Front Matter

DOI: 10.4230/LIPIcs.CPM.2018.0

Front Matter, Table of Contents, Preface, Conference Organization

Authors: Gonzalo Navarro, David Sankoff, and Binhai Zhu

Published in: LIPIcs, Volume 105, 29th Annual Symposium on Combinatorial Pattern Matching (CPM 2018)

Abstract

Front Matter, Table of Contents, Preface, Conference Organization

Cite as

29th Annual Symposium on Combinatorial Pattern Matching (CPM 2018). Leibniz International Proceedings in Informatics (LIPIcs), Volume 105, pp. 0:i-0:xvi, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2018)

Copy BibTex To Clipboard

@InProceedings{navarro_et_al:LIPIcs.CPM.2018.0,
  author =	{Navarro, Gonzalo and Sankoff, David and Zhu, Binhai},
  title =	{{Front Matter, Table of Contents, Preface, Conference Organization}},
  booktitle =	{29th Annual Symposium on Combinatorial Pattern Matching (CPM 2018)},
  pages =	{0:i--0:xvi},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-074-3},
  ISSN =	{1868-8969},
  year =	{2018},
  volume =	{105},
  editor =	{Navarro, Gonzalo and Sankoff, David and Zhu, Binhai},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops-dev.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2018.0},
  URN =		{urn:nbn:de:0030-drops-86849},
  doi =		{10.4230/LIPIcs.CPM.2018.0},
  annote =	{Keywords: Front Matter, Table of Contents, Preface, Conference Organization}
}

Document

DOI: 10.4230/LIPIcs.MFCS.2016.37

On the General Chain Pair Simplification Problem

Authors: Chenglin Fan, Omrit Filtser, Matthew J. Katz, and Binhai Zhu

Published in: LIPIcs, Volume 58, 41st International Symposium on Mathematical Foundations of Computer Science (MFCS 2016)

Abstract

The Chain Pair Simplification problem (CPS) was posed by Bereg et al. who were motivated by the problem of efficiently computing and visualizing the structural resemblance between a pair of protein backbones. In this problem, given two polygonal chains of lengths n and m, the goal is to simplify both of them simultaneously, so that the lengths of the resulting simplifications as well as the discrete Frechet distance between them are bounded. When the vertices of the simplifications are arbitrary (i.e., not necessarily from the original chains), the problem is called General CPS (GCPS). In this paper we consider for the first time the complexity of GCPS under both the discrete Frechet distance (GCPS-3F) and the Hausdorff distance (GCPS-2H). (In the former version, the quality of the two simplifications is measured by the discrete Fr'echet distance, and in the latter version it is measured by the Hausdorff distance.) We prove that GCPS-3F is polynomially solvable, by presenting an widetilde-O((n+m)^6 min{n,m}) time algorithm for the corresponding minimization problem. We also present an O((n+m)^4) 2-approximation algorithm for the problem. On the other hand, we show that GCPS-2H is NP-complete, and present an approximation algorithm for the problem.

Cite as

Chenglin Fan, Omrit Filtser, Matthew J. Katz, and Binhai Zhu. On the General Chain Pair Simplification Problem. In 41st International Symposium on Mathematical Foundations of Computer Science (MFCS 2016). Leibniz International Proceedings in Informatics (LIPIcs), Volume 58, pp. 37:1-37:14, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2016)

Copy BibTex To Clipboard

@InProceedings{fan_et_al:LIPIcs.MFCS.2016.37,
  author =	{Fan, Chenglin and Filtser, Omrit and Katz, Matthew J. and Zhu, Binhai},
  title =	{{On the General Chain Pair Simplification Problem}},
  booktitle =	{41st International Symposium on Mathematical Foundations of Computer Science (MFCS 2016)},
  pages =	{37:1--37:14},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-016-3},
  ISSN =	{1868-8969},
  year =	{2016},
  volume =	{58},
  editor =	{Faliszewski, Piotr and Muscholl, Anca and Niedermeier, Rolf},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops-dev.dagstuhl.de/entities/document/10.4230/LIPIcs.MFCS.2016.37},
  URN =		{urn:nbn:de:0030-drops-64510},
  doi =		{10.4230/LIPIcs.MFCS.2016.37},
  annote =	{Keywords: chain simplification, discrete Frechet distance, dynamic programming, geometric arrangements, protein structural resemblance}
}

Document

DOI: 10.4230/LIPIcs.CPM.2016.15

Genomic Scaffold Filling Revisited

Authors: Haitao Jiang, Chenglin Fan, Boting Yang, Farong Zhong, Daming Zhu, and Binhai Zhu

Published in: LIPIcs, Volume 54, 27th Annual Symposium on Combinatorial Pattern Matching (CPM 2016)

Abstract

The genomic scaffold filling problem has attracted a lot of attention recently. The problem is on filling an incomplete sequence (scaffold) I into I', with respect to a complete reference genome G, such that the number of adjacencies between G and I' is maximized. The problem is NP-complete and APX-hard, and admits a 1.2-approximation. However, the sequence input I is not quite practical and does not fit most of the real datasets (where a scaffold is more often given as a list of contigs). In this paper, we revisit the genomic scaffold filling problem by considering this important case when, (1) a scaffold S is given, the missing genes X = c(G) - c(S) can only be inserted in between the contigs, and the objective is to maximize the number of adjacencies between G and the filled S' and (2) a scaffold S is given, a subset of the missing genes X' subset X = c(G) - c(S) can only be inserted in between the contigs, and the objective is still to maximize the number of adjacencies between G and the filled S''. For problem (1), we present a simple NP-completeness proof, we then present a factor-2 greedy approximation algorithm, and finally we show that the problem is FPT when each gene appears at most d times in G. For problem (2), we prove that the problem is W[1]-hard and then we present a factor-2 FPT-approximation for the case when each gene appears at most d times in G.

Cite as

Haitao Jiang, Chenglin Fan, Boting Yang, Farong Zhong, Daming Zhu, and Binhai Zhu. Genomic Scaffold Filling Revisited. In 27th Annual Symposium on Combinatorial Pattern Matching (CPM 2016). Leibniz International Proceedings in Informatics (LIPIcs), Volume 54, pp. 15:1-15:13, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2016)

Copy BibTex To Clipboard

@InProceedings{jiang_et_al:LIPIcs.CPM.2016.15,
  author =	{Jiang, Haitao and Fan, Chenglin and Yang, Boting and Zhong, Farong and Zhu, Daming and Zhu, Binhai},
  title =	{{Genomic Scaffold Filling Revisited}},
  booktitle =	{27th Annual Symposium on Combinatorial Pattern Matching (CPM 2016)},
  pages =	{15:1--15:13},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-012-5},
  ISSN =	{1868-8969},
  year =	{2016},
  volume =	{54},
  editor =	{Grossi, Roberto and Lewenstein, Moshe},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops-dev.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2016.15},
  URN =		{urn:nbn:de:0030-drops-60791},
  doi =		{10.4230/LIPIcs.CPM.2016.15},
  annote =	{Keywords: Computational biology, Approximation algorithms, FPT algorithms, NP- completeness}
}

Search Results

Documents authored by Zhu, Binhai

Beyond the Longest Letter-Duplicated Subsequence Problem

Abstract

Cite as

Genomic Problems Involving Copy Number Profiles: Complexity and Algorithms

Abstract

Cite as

The Tandem Duplication Distance Is NP-Hard

Abstract

Cite as

A 2-Approximation Algorithm for the Complementary Maximal Strip Recovery Problem

Abstract

Cite as

LIPIcs, Volume 105, CPM'18, Complete Volume

Abstract

Cite as

Front Matter, Table of Contents, Preface, Conference Organization

Abstract

Cite as

On the General Chain Pair Simplification Problem

Abstract

Cite as

Genomic Scaffold Filling Revisited

Abstract

Cite as

Thanks for your feedback!

Could not send message